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Abstract 

Background: The widespread protozoan parasite Toxoplasma gondii interferes with host cell functions by exporting 
the contents of a unique apical organelle, the rhoptry. Among the mix of secreted proteins are an expanded, 
lineage-specific family of protein kinases termed rhoptry kinases (ROPKs), several of which have been shown to be key 
virulence factors, including the pseudokinase R0P5.The extent and details of the diversification of this protein family 
are poorly understood. 

Results: In this study, we comprehensively catalogued the ROPK family in the genomes of Toxoplasma gondii, 
Neospora caninum and Eimeria tenella, as well as portions of the unfinished genome of Sarcocystis neurona, and 
classified the identified genes into 42 distinct subfamilies. We systematically compared the rhoptry kinase protein 
sequences and structures to each other and to the broader superfamily of eukaryotic protein kinases to study the 
patterns of diversification and neofunctionalization in the ROPK family and its subfamilies. We identified three ROPK 
sub-clades of particular interest: those bearing a structurally conserved N-terminal extension to the kinase domain 
(NTE), an £ fene//a-specific expansion, and a basal cluster including ROP35 and BPK1 that we term ROPKL. Structural 
analysis in light of the solved structures R0P2, R0P5, R0P8 and in comparison to typical eukaryotic protein kinases 
revealed ROPK-specific conservation patterns in two key regions of the kinase domain, surrounding a ROPK-con served 
insert in the kinase hinge region and a disulfide bridge in the kinase substrate-binding lobe. We also examined 
conservation patterns specific to the NTE-bearing clade. We discuss the possible functional consequences of each. 

Conclusions: Our work sheds light on several important but previously unrecognized features shared among rhoptry 
kinases, as well as the essential differences between active and degenerate protein kinases. We identify the most 
distinctive ROPK-specific features conserved across both active kinases and pseudokinases, and discuss these in terms 
of sequence motifs, evolutionary context, structural impact and potential functional relevance. 
By characterizing the proteins that enable these parasites to invade the host cell and co-opt its signaling mechanisms, 
we provide guidance on potential therapeutic targets for the diseases caused by coccidian parasites. 



Background 

Toxoplasma gondii is an intracellular parasite that infects 
a wide range of hosts, including an estimated one-third 
of the worlds human population [1]. The resulting dis- 
ease toxoplasmosis can be serious in pregnant women and 
immunocompromised individuals, and as an opportunis- 
tic infection associated with AIDS and cancer patients [2], 
T. gondii and its evolutionary relatives, the Coccidia, form 
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a clade of parasitic protozoa involved in many human and 
veterinary diseases such as toxoplasmosis and coccidio- 
sis. Coccidians are a lineage within the protozoan phy- 
lum Apicomplexa, which also includes the deadly malaria 
pathogen Plasmodium falciparum. Thus, T. gondii also 
serves as an experimentally tractable model organism for 
studying the shared and contrasting biological properties 
of the Apicomplexa and other intracellular parasites [3,4] . 

Apicomplexans contain a unique system of apical 
organelles called the apical complex, consisting of rhop- 
tries, micronemes and dense granules [5]. At the initia- 
tion of host cell invasion, the contents of the rhoptries 
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are injected into the host cell and the forming para- 
sitophorous vacuole which protects the intracellular par- 
asite [6]. Once there, the parasite proteins can disrupt 
host cell signaling and defense mechanisms and assist in 
recruiting host organelles [7] . 

Proteomic profiling of T. gondii rhoptries [8] and ana- 
lyis of apicomplexan genomic sequences [9-12] revealed 
that many of the proteins secreted by coccidians are pro- 
tein kinases, a class of enzymes that regulate cell signal 
transduction through phosphorylation. This expanded, 
rapidly evolving family of kinases and pseudokinases has 
been termed the rhoptry kinase (ROPK) family [10], 
or ROP2 family, in reference to a representative mem- 
ber of the family [9]. While rhoptry kinases appear to 
be unique to the Coccidia, the involvement of lineage- 
specific protein kinase families in host-parasite interac- 
tions is observed across the Apicomplexa [13]. Several 
rhoptry kinases have been shown to be involved in vir- 
ulence and alteration of host cell transcription [7,14]. 
These include ROP18, a key modulator of parasite growth 
and virulence which is localized to the parasitophorous 
vacuole membrane (PVM) [15,16], and ROP5, another 
PVM-associated protein which assists ROP18 in blocking 
the host immune response [17-21]. ROP16 localizes to the 
host cell nucleus and interacts with the STAT3 and STAT6 
immune-response signaling pathways [22-26], and ROP38 
has been implicated in the modulation of host MAPK 
signaling [10]. 

Protein kinases are a diverse family of enzymes which 
have been successfully targeted for inhibition in human 
cancers, and show promise for treating infections by pro- 
tozoan pathogens as well [27]. ATP-competitive small- 
molecule inhibitors have been developed to specifically 
target catalytically active protein kinases in parasitic pro- 
tozoa [28]. Since many of the ROPKs appear to also 
be catalytically active, there may be an opportunity to 
target these kinases for infectious diseases. However, 
the "catalytic triad" of residues considered essential for 
kinase enzymatic activity [29] is altered in about half 
of the identified ROPKs [10]. Pseudokinases have been 
observed to perform important functions in other sys- 
tems, typically through inducing allosteric changes in 
other interacting partners (e.g. [30,31]; reviewed in [32- 
34]). The overall expansion of pseudokinases in the 
ROPK family underscores observations that some cat- 
alytically inactive ROPKs nonetheless play important, 
functional roles through interaction with other pro- 
teins [18,19,35]. Structural studies showed that the pseu- 
dokinase virulence factors ROP2, ROP8 and ROP5 do 
indeed form a protein kinase fold; ROP2 and ROP8 
were indicated to be unable to bind ATP [36], while 
ROP5 was shown to bind ATP in an atypical, noncat- 
alytic conformation [37]. An interplay between ROP5, 
the active kinase ROP18 and a host immunity-related 



GTPase has been identified [18,19], demonstrating the 
potential for complex interplay between rhoptry kinases 
and the host cell signaling pathways. However, the full 
extent of the diversity in ROPK family, in terms of func- 
tion, potential interacting partners, protein structure and 
structural mechanisms, is poorly understood. With the 
availability of molecular sequence and structural data 
from multiple strains of T. gondii and related apicom- 
plexans, we can use comparative methods to examine 
the molecular evolution of ROPKs and identify functional 
shifts that may point to distinct regulatory roles and 
mechanisms. 

We catalogued the rhoptry kinases in several fully 
sequenced coccidian genomes, including Toxoplasma 
gondii, Neospora caninum, Sarcocystis neurona and Eime- 
ria tenella, and compared them to the broader eukary- 
otic protein kinase (ePK) superfamily and to each other 
to study the patterns of diversification and neofunction- 
alization in the ROPK family and its subfamilies. We 
propose previously unidentified rhoptry kinases in each 
of these genomes, including several putative new ROPK 
subfamilies. We studied the variation in these subfami- 
lies in light of the solved structures of ROP2, ROP8 and 
ROP5 proteins, and relative to "typical" eukaryotic pro- 
tein kinases. Both pseudokinases and catalytically active 
kinases appear to be prevalent throughout the ROPK 
family. We found a striking co-evolution of structural 
inserts within the canonical protein kinase domain and 
the residues that interact with them. Most noteworthy 
among these is a pattern of residues surrounding the 
ROPK- specific ckC helix in the kinase "hinge" region. 
We also recovered another pattern of co-conserved cys- 
teines that form a disulfide bond in the substrate-binding 
C-lobe. We then discuss some possible functional con- 
sequences of these distinguishing features of the ROPK 
family. 

Results 

To examine the molecular evolution and functional shifts 
in ROPKs, we used the genomic, mRNA and pro- 
teomic sequences of multiple T. gondii strains, Neospora 
caninum, Sarcocystis neurona and Eimeria tenella to 
develop profiles for 42 subfamilies of ROPK, reflecting 
orthology as well as chromosomal patterns of tandem 
repeats (see Methods). 

We used these sequence profiles to perform an analy- 
sis of evolutionary constraints, applying statistical tests of 
contrasting conservation between gene clades to identify 
potential sites of subfunctionalization and neofunction- 
alization in the ROPK family and each ROPK subfamily. 
We then mapped the sites and regions of interest onto 
solved structures of ROP2, ROP8 and ROP5 to examine 
the structural and possible functional roles these features 
may play within the parasite proteins. 
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Global trends in the ROPK family 

We used a set of HMM profiles derived from our sub- 
family sequence alignments to scan the translated gene 
model sequences available for T. gondii strains GT1, ME49 
and VEG, N. caninum and E. tenella and classify putative 
ROPK genes into the identified subfamilies. We found 37, 
55 and 38 ROPK genes in T. gondii strains GT1, ME49 and 
VEG, respectively, 44 in N. caninum and 27 in E. tenella 
(Additional file 1). The elevated ROPK counts in T. gondii 
ME49 relative to the other strains is probably due to dif- 
ferences in sequencing depth and the quality of assembly 
and gene model annotation; we also found genomic evi- 
dence of unannotated orthologs in the other strains. As 
suggested by Reese and Boyle [35], ROPK genes are often 
present in expanded loci (sites of gene duplication, usu- 
ally in tandem array) and are probably undercounted in 
annotated genomes. 

By incorporating sequences from multiple coccidian 
species into HMM profiles, we were able to identify 
several putative ROPKs that were not identified in pre- 
vious computational surveys [9,10]. These include the 
proposed subfamilies ROP47, ROP48, ROP49 and ROP50, 
present in T. gondii and N. caninum, and the E. tenella- 
specific subfamilies ROPK-Etenl, ROPK-Eten2a, ROPK- 
Eten2b, ROPK-Eten3, ROPK-Eten4, ROPK-Eten5 and 
ROPK-Eten6. We suggest these to be likely rhoptry 
kinases on the basis of sequence homology, phyloge- 
netic placement, signal peptide presence, and exist- 
ing experimental evidence. Protein or mRNA expres- 
sion has been previously observed for at least one 
member of each of these proposed subfamilies, indi- 
cating that they are not pseudogenes. ROP47, ROP49 
and ROP50 are predicted to contain a signal peptide. 
The gene coding for ROP48 has only been annotated 
in T. gondii strain ME49 (TGME49_234950, numbered 
TGME49_034950 in ToxoDB prior to version 8.0), but 
we identified genomic regions with 95% sequence iden- 
tity to this protein sequence on chromosome X of 
strains VEG and GT1 as well. Recently, a proteomics 
study observed two E. tenella proteins expressed dur- 
ing the sporozoite stage and localized in the rhoptries: 
ETH_00027700, which we assigned to the ROPK-Etenl 
subfamily, and ETH_00005190, which we assigned to the 
ROPK- Unique category [38]. A search of the available 
S. neurona ESTs and genomic scaffolds indicates that 
ROPKs are prevalent in this species as well, though we 
cannot assign a specific number until the assembly is 
complete. The subfamilies that have clear representa- 
tives in all four of the surveyed species are ROP21/27 
and ROP35. 

In S. neurona, rhoptries are present in the sporozoite 
[39] and bradyzoite [40] stages but absent from sch- 
izonts and merozoites [41]. Surprisingly, we found S. neu- 
rona genomic regions and expressed sequence tags (ESTs) 



from the schizont and merozoite stages that appear to 
code for rhoptry kinases. Of the ESTs currently avail- 
able in the NCBI GenBank EST database, we identified 
seven putative rhoptry kinases [GenBank:BM303 139.1, 
BM303688.1, BQ749596.1, BQ750005.1, BU085181.1, 
CO748650.1, CV193082.1], all obtained from the S. neu- 
rona merozoite stage, evidence that these genes are indeed 
expressed despite the absence of rhoptry organelles during 
this life stage. We also examined genomic open reading 
frames (ORFs) for signal peptides using the program Sig- 
nalP [42] and identified likely signal peptide regions and 
cleavage sites in several of the ORFs that we predicted to 
encode rhoptry kinases, suggesting that at least some of 
these are likely to be exported. 

Both pseudokinases and catalytically active kinases 
appear to be prevalent throughout the ROPK family, in 
roughly equal numbers of subfamilies. The pseudokinase 
subfamilies are distributed throughout the phylogenetic 
tree, rather than forming any distinct clade, suggesting 
that the evolutionary pressures that lead to the degenera- 
tion of paralogs into pseudokinases have applied through- 
out the ROPK family. 

Phylogenetic clustering reveals distinct sub-clades 

We inferred a phylogenetic tree from the consensus 
sequences of each of the ROPK subfamilies to illustrate 
evolutionary patterns within the ROPK family (Figure 1). 
Several distinct clades emerged, which we examined more 
specifically: rhoptry kinases with homology to the N- 
terminal extension (NTE) observed in ROP2, ROP8 and 
ROP5 structures (discussed below); an expanded clade of 
seven subfamilies specific to E. tenella; and a basal clade 
of divergent, ROPK-like protein kinases, including ROP35 
and BPK1, which we refer to as ROPKL here. 

Within the E. tenella-specific clade, the putative 
ROPK proteins ETH_00028855, ETH_00020620 and 
ETH_00000075, which we placed in the subfamilies 
Eten2b, Eten3 and Eten4, respectively, were recently 
observed to be expressed solely in merozoites [38]. The 
emergence of this gene clade reflects the significant phylo- 
genetic and phenotypic divergence of the oocyst-forming 
E. tenella from the other tissue-cyst-forming coccidian 
species we have examined here [43]. E. tenella also con- 
tains several putative ROPKs outside this clade, more 
closely related to the ROPKs found in T. gondii and N. can- 
inum, which we placed in the ROPK-Unique category 
(Additional file 1). 

The previously identified proteins in the ROPKL clade 
are ROP33, ROP34, ROP35 and ROP46. The clade also 
contains the brazyzoite-expressed pseudokinase BPK1 
[44]. The gene models of the ROPKL proteins in T. gondii 
ME49, the best-annotated strain, all contain at least one 
intron, in contrast to most other ROPK genes, which are 
typically encoded by a single exon. 
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Figure 1 Phylogeny of rhoptry kinase subfamilies. Predicted or known active kinases are labeled in bold text, and kinases that may have a 
noncanonical catalytic mechanism are marked with an asterisk. Newly proposed ROPK subfamilies are labeled in italic text. The clade indicated in 
red contains the ROPK subfamilies with a homologous N-terminal extension to the kinase domain (NTE). The clade in green is specific to £ tenello. 
The divergent "ROPKL" clade is shown in blue. Branch labels indicate bootstrap support. The grid along the right side indicates the species in which 
each subfamily appears: T. gondii (Tg), N. caninum (Nc), S. neurona (Sn) and £ tenella (Et). 



Known or likely catalytic kinases 

In our analysis, we consider the catalytically essential 
residues to be the aspartate in the catalytic loop ("HRD" 
motif, D166 PKA ) and the aspartate in the Mg-binding 
loop at the start of the activation segment ("DFG" motif, 
D184 PKA ); we categorize the ROPK subfamilies missing 
either of these residues as pseudokinases. Additionally 
important residues involved in ATP positioning or confor- 
mational changes necessary for catalytic activity include 
a glycine in subdomain I (G52 PKA ), lysine in subdomain 
II ("VAIK" motif, K72 PKA ), glutamate in subdomain III 
(E91 PKA ) and asparagine in the catalytic loop (N171 PKA ) 
[29,45,46], as well as the F-helix aspartate which positions 
the catalytic loop ("DxW" motif, D220 PKA ) [47]. While 
catalysis has been observed in kinases that lack one or 
more of these residues, their absence usually indicates 



a noncanonical mechanism or impairment of activity 
[31,48,49]. 

The subfamilies ROP11, ROP16, ROP17, ROP18, 
ROP19/29/38, ROP20, ROP21/27, ROP25, ROP28, 
ROP30, ROP31, ROP32, ROP35, ROP39 and ROP41 were 
previously suggested to be active kinases based on the 
conserved catalytic triad [10]. Phosphoryl transfer has 
been demonstrated experimentally for ROP18 [50] and 
ROP16 [24], and molecular modelling simulations have 
shown that ATP could dock in a typical conformation to 
ROP11, ROP16, ROP17 and ROP18 [36]. Our analysis 
additionally found the catalytically essential residues con- 
served in ROP33, ROP34 and ROP46, suggesting these 
may also be active kinases. Of the E. tenella-specific sub- 
families we identified, ROPK-Etenl also retains all of the 
key residues needed for catalysis (Figure 2). 
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Figure 2 Conserved motifs of catalytically active rhoptry kinase subfamilies. Sequence logos of key regions in the kinase domain of the 
broader ePK superfamily and of predicted active ROPK subfamilies as they occur in the coccidian species examined. Letter height at each sequence 
position indicates greater conservation of that character in a multiple sequence alignment of a large set of ePK sequences (first row) and the 
annoted genomic sequences of each ROPK subfamily. Row labels indicate subfamily names, with the number of sequences in each alignment 
shown in parentheses. The ePK-conserved motifs shown are the glycine-rich loop in subdomain I, catalytic lysine in subdomain II, aC glutamate in 
subdomain III, catalytic loop in subdomain VIb, "DFG" in subdomain VII, "APE" in subdomain VIII, aV "DxW" in subdomain IX, and arginine in 
subdomain XI. The adjacent sequence sites surrounding each motif are included for context. Asterisks indicate above ePK motifs indicate the 
catalytic triad. Generated using the WebLogo [82] and ReportLab [83] libraries. 



Known or likely pseudokinases 

Kinases that lack one or more of the residues nec- 
essary for catalysis are likely to be non-catalytic 
pseudokinases. The apparent pseudokinase ROPK sub- 
families are ROP2/8, ROP4/7, ROP5, ROP22, ROP23, 
ROP26, ROP36, ROP37, ROP40 and ROP42/43/44, as 
identified previously [10]. We include BPK1, previously 
noted as a I gondii brazyzoite-expressed pseudokinase 
[44], in the ROPK family based on sequence similarity. 
Additionally, our proposed subfamilies ROP47, ROP49, 



ROP50, and the E. ten ella- specific ROPK-Eten4, ROPK- 
Eten5 and ROPK-Eten6, are also missing key aspartates 
involved in the kinase catalytic mechanism and are likely 
to be pseudokinases (Figure 3). ROP50 does have an aspar- 
tate at the HRD+3 position (Figure 3), so in absence of a 
structure we cannot rule out that this nearby residue may 
play a compensatory role in catalysis. 

Several of these pseudokinase subfamilies share the 
unusual characteristic of replacing the catalytic aspar- 
tate (in the kinase-conserved "HRD" motif) with a basic 
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Figure 3 Conserved motifs of likely inactive rhoptry kinase subfamilies. Sequence logos of conserved motif regions in the kinase domain of 
the broader ePK superfamily and of predicted pseudokinase ROPK subfamilies as they occur in the coccidian species examined. 



residue: ROP4/7 (HGK), ROP5 (HG[R/K/H]), ROP22 
(HTH), ROP36 (HGH), ROP40 (LRR) and ROP42-43-44 
(HGK), as previously noted [37]. 

Noncanonical kinases 

The subfamilies ROP24, ROP45 and the proposed ROP48, 
ROPK-Eten2a and ROPK-Eten2b have most of the 
residues necessary for catalysis, but with some differ- 
ences in other typically conserved residues that suggest 
the mechanisms may be noncanonical (Figure 4). 

In most active ePKs, an asparagine in the catalytic 
loop (N171 PKA ) coordinates a magnesium ion to posi- 
tion ATP in the active site [29]. This residue varies 
among some ROPKs: In ROP24, ROP45 and ROP48, the 
asparagine is replaced by a basic residue (lysine, histidine 
and lysine, respectively). The closely related E. tenella- 
specific subfamilies ROPK-Eten2a and ROPK-Eten2b have 



the catalytic loop motifs HNDLKLDG and HNDLKLSS, 
respectively, each replacing the ePK-conserved asparagine 
with a different residue type. Such replacements are rare 
in catalytically active kinases; in an alignment of ePK 
sequences (not shown), we observed only two cases in 
which the "HRD" motif is conserved without the acco- 
manying asparagine, both of which have been shown 
to have noncanonical catalytic mechanisms: CASK [49], 
which replaces the asparagine with a cysteine, and Type II 
PAK [51], which has a serine. 

The ePK-conserved lysine in subdomain II (/?3) is 
replaced with arginine in ROP45, ROPK-Eten2a and 
ROPK-Eten2b, though the conserved C-helix glutamate 
is retained, suggesting the necessary salt bridge could 
still form in the active state of these kinase as in other 
ePKs. In ROP24, however, the lysine is retained but 
the corresponding C-helix glutamate is instead alanine, 
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Figure 4 Conserved motifs of ROPK subfamilies with potentially noncanonical catalytic mechanisms. Sequence logos of conserved motif 
regions in the kinase domain of the broader ePK superfamily and of ROPK subfamilies with predicted noncanonical catalytic mechanisms as they 
occur in the coccidian species examined. 



precluding a salt bridge. The DFG motif is replaced 
with the sequence GFT, though a potentially compen- 
satory acidic residue appears at the DFG+1 position. 
These observations suggest that the activation mech- 
anism [52,53] in ROP24 could be different from that 
of other ePKs. ROP48 retains the /33 lysine, aC glu- 
tamate and DFG motif; however, the substrate-binding 
lobe is quite divergent, with a dramatically shortened 
activation loop and F-helix, and the F-helix DxW motif 
is replaced with ESS, which suggests that the position- 
ing of the catalytic loop occurs differently from other 
ePKs. 

The E. tenella-specific subfamily ROPK-Eten3, in con- 
trast to all the other identified ROPK subfamilies, 
appears to comprise both active and inactive kinases. 
The locus appears as a tandem repeat of 5 simi- 
lar genes, with pairwise identity ranging from 32% to 
52% (mean 41%), only one of which (ETH_00020585) 
retains the key residues indicating catalytic function 
(Figure 4). 

ROPK-conserved inserts within the protein kinase domain 

ROPK- and subfamily-specific inserts within the kinase 
domain are widespread, suggesting unique functional 
adaptations [36,37,50]. We found six conserved inserts in 
the ROPK domain relative to the PK domain (Figure 5). 
They are: 

(i) An extension of the /33-aC loop, residues 
289-293 ROP2 , of varying length across ROPK 
subfamilies; it is fairly short (4-5 amino acids) in the 
NTE-bearing clade, missing altogether in ROPKL, 
but extends up to 13 amino acids other ROPKs 
including the E. ten elh -specific clade. 



(ii) C-terminal to the aC helix, residues 309-318 ROP2 , 
present in all subfamilies except the ROPKL clade in 
roughly equal size. In the ROP2/8 structures 
[PDB:2WlZ,3DZO,3BYV] it was observed to form an 
additional helix, termed aC [36], in the kinase 
inter-lobe hinge area (discussed below), while in the 
ROP5 structures [PDB:3Q5F,3Q60] it is disordered. 

(iii) In p4-p5 loop, residues 335-351 ROP2 , present in 
most subfamilies, including ROP33 but not the other 
ROPKLs, in similar size. In a ROP2 structure 
[PDB:2W1Z] this appears as two strands, termed 
ft and /3", that extend the loop to form a /3 -hairpin in 
the kinase N-lobe [36], spatially near the a" helix of 
the NTE. In the other structure of ROP2, ROP8 and 
ROP5 [PDB:3DZO,3BYV,3Q5F,3Q60] this region is 
mostly disordered, though the protein sequences 
indicate the insert is present in this subfamily 

as well. 

(iv) Between the kinase APE motif (end of the activation 
segment) and the a¥ helix, residues 453-462 ROP2 , 
present in varying lengths across the ROPK 
subfamilies including each of the major clades (NTE, 
Eten, ROPKL). This is near the substrate-binding site 
in typical protein kinases. The insert appears as a 
short 4aa loop in ROP5 [PDB:3Q60], but in ROP2 
[PDB:3DZO] and ROP8 [PDB:3BYV] it forms an 
additional single-turn helix in crystal structures 
[PDB:3DZO, PDB:3BYV] [50], though this feature 
may have been stabilized in the crystals because of 
crystal packing. 

(v) An extension of the aF-aG loop, absent from 
ROP2/8, ROP40 and ROP49 and the ROPKL clade, 
but present in ROP5 and the other ROPK subfamilies 
in the region of residues 467-478 ROP5 . In the ROP5 
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R0P2 
(PDB:2W1Z) 




Figure 5 Structural location of ROPK-conserved inserts. Inserts 
relative to the conserved ePK fold are highlighted in red. The 
N-terminal extension (NTE) and C-terminal extended helix (a HO are 
shown in blue. Novel secondary structures are labeled according to 
Labesse et al. [36]. 



structures [PDB:3Q5F,3Q60], B-factors indicate this 
elongation of the aF-aG loop is relatively flexible 
compared to the adjacent regions; the G -helix itself 
appears unfolded. Sequences of other ROPKs, 
including ROP24, suggest it is even longer in those 
subfamilies. 

(vi) In the aG-aH loop, near the C-terminus of the aG 
helix, a 5aa insert absent from ROP2/8, ROP5, 
ROP18, ROP23, ROP25, ROP26, ROP30 and ROP40 
and the ROPKLs but present in the other ROPK 
subfamilies including the E. ten elh -specific clade. 
The ROPKLs appear to have large deletions in this 
region, and may be missing the aG helix structure 
altogether. We note that the aG-aH loop is 
extended in many other protein kinases, most 
notably CMGC kinases [54]. 

Distinguishing ROPK-specific conserved sites in the protein 
kinase domain, and corresponding structural features 

We evaluated shifts in site-specific residue conservation 
between the ROPK family and overall PK superfamily by 
performing a goodness-of-fit test of residue frequences in 
the two sequence sets at each aligned column of the PK 
domain (see Methods). The same comparisons were also 
performed with each subfamily versus the other ROPKs 
(Additional file 2). 



Hinge region 

The most statistically significant sites distinguishing 
ROPKs from PKs overall are in the kinase hinge region. 
Numbered according to ROP2 [PDB:2W1Z], these are: 
sites [E/R/Y]320, L321, [R/G]322, [V/L/A]325 and P326 in 
the aC-p^ loop; P358 in the p5-aD loop, and [L/F/Y]424 
in the /38 strand (Figure 6). Two residues in the aE 
helix, [L/A/S]396 and [H/N/S]399, are oriented toward 
the hinge region and under the aC' helix. 

The residue P358 ROP2 is typically a glutamate in most 
eukaryotic protein kinases (e.g. E121 PKA , E565 FGFR ), 
where it contributes to the opening/closing motion of the 
kinase during activation by forming a lobe-bridging salt 
bridge interaction [55]. In fibroblast growth factor recep- 
tor kinase (FGFR), for example, the equivalent residue 
E565 hydrogen-bonds with K641 in the ^8 strand con- 
ditionally upon phosphorylation of the FGFR activation 
loop [56] (Figure 6D,E). In ROP2, the residues equiva- 
lent to E565 and K641 are P358 and F424, respectively 
(Figure 6A,B). Since proline and phenylalanine are not 
charged residues, the ROP2 structure is incapable of 
forming the same interaction. The residue P358 ROP2 is 
conserved as a proline throughout most of the ROPK 
family, with the exception of subfamilies ROP18 (methio- 
nine), ROP21/27 (aspartate, though a Phe appears in the 
£8 strand), ROP26 (serine), ROP32 (histidine), ROP41 
(lysine), and the E. tenella-specific subfamilies (retained 
as glutamate, though only ROPK-Etenl also retains a basic 
residue in the /38 strand) (Additional file 2). 

The residues at sites P358 ROP2 and P326 ROP2 appear to 
have instead taken on another structural role. In ROPKs, 
the residue immediately N-terminal to P358 ROP2 , a site 
known as the kinase "gatekeeper" residue, is a large, 
usually hydrophobic residue oriented toward the aC'-/34 
strand and, in the ROP2 structure, packing against the 
ROPK-conserved P326; the hydrophobic residue imme- 
diately N-terminal to P326 (most commonly valine but 
also varyingly leucine, alanine, phenylalanine, isoleucine 
and methionine in ROPKs) is likewise oriented toward 
the linker in the ROP2 structure, packing against P358 
(Figure 6C). These four residues thus form a stable pack- 
ing "box" bridging the a?C-^4 and @5-aD loops. 

F-helix "WC" motif and disulfide bridge 

A distinctive "WC" motif appears at the end of the 
aF helix (Figure 7) in most ROPKs. The cysteine 
(C485 ROP2 ), together with another ROPK-conserved cys- 
teine (C506 ROP2 ) [9] in the aG-aH insert described 
above, forms a disulfide bond which has been proposed to 
stabilize the two helices [50]. The tryptophan (W484 ROP2 ) 
appears to pack against the extended aD and aE helices, 
pushing the aE helix futher outward. Thus the "WC" 
motif couples two ROPK-specific inserts to the substrate- 
binding lobe of the kinase core. There are no other known 
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Figure 6 Contrasting sites between ROPK and PK: kinase hinge region. (A) Sequence logos of regions surrounding the aC helix insert and 
kinase hinge in ROPK (top) and PK (bottom), with selected contrasting sites highlighted. (B) R0P2 structure with selected contrasting sites shown in 
"sticks" representation. (C) Inset of the R0P2 structure around the contrasting prolines in the aC-fi4 loop and linker. Two other adjacent residues, 
1327 and Y357, pack against the prolines on opposite sides. (D) and (E) Two structures of the protein kinase FGFR show the conditional salt bridge 
between the linker glutamate (E565) and /38 lysine (K641 ), commonly observed in typical protein kinases, dependent upon kinase activation. 



protein kinase families or subfamilies in which cysteines 
at the end of the F-helix and in the aG-aH loop co-occur 
in positions that could potentially interact. Additionally, 
both the WC motif and the aG-aH cysteine are absent 
from the E. tenella and ROPKL clades. 

Another site in the a¥ helix (W482 ROP2 ) is conserved 
as a glutamate in most ePKs (E230 PKA ), but unconserved 



in ROPKs, suggesting that a selective constraint that con- 
serves glutamate at this site in most ePKs has been lost in 
the ROPK family. In at least some other ePKs, it appears 
that this glutamate can interact with a basic residue on 
the polar/charged surface of the amphipathic aD helix 
(R133 PKA ), as well as a conserved tyrosine in the P+l 
pocket (Y204 PKA ) at the end of the activation segment 
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Figure 7 Contrasting sites between ROPK and PK: C-lobe WC motif and loss of Glu constraint. (A) and (B) Structures of R0P2 and R0P5 with 
WC motif and ROPK-conserved disulfide bridge residues shown in "sticks" representation. (C) Sequence logos of F helix region in ROPK (top) and 
ePK (bottom), with contrasting sites highlighted. (D) PKA, a representative typical protein kinase, with equivalent residues shown as sticks. (E) CDK2, 
another typical protein kinase. The "CMGC insert" occurs in the aG-aY\ loop but does not perform the same structural role as the ROPK-specific 
insert in the same region. 



(Figure 7D,E). Notably, the mutation of E230 to glutamine 
in PKA not only disrupted substrate recognition and 
phosphoryl transfer, but also resulted in higher temper- 
ature factors in the aD helix, particularly in R133 [57]. 
However, in ROPKs the interaction between the F and D 
helices occurs somewhat differently: in ROP5, R455 inter- 
acts with E345 and Y427, and in ROP2, W482 packs with 
H365, while the P+l -pocket Tyr replaced by F446, a side 
chain not capable of hydrogen bonding (Figure 7A,B). 

N-terminal extension to the protein kinase domain 

Structural studies of ROP2, ROP8 and ROP5 revealed 
another feature common to each of these proteins, 
an N-terminal extension (NTE) to the canonical pro- 
tein kinase domain consisting of at least two addi- 
tional helices and a beta sheet, with the region between 
the two helices varying between ROP2/8 and ROP5 
[36,37,50]. The NTE has also been suggested to be 
present in ROP18, ROP4/7 and ROP17 based on sequence 
homology, though its presence does not appear to be 



universal among rhoptry kinases [37,50]. We inves- 
tigated the distinguishing features of NTE-containing 
rhoptry kinases to determine whether other ROPKs 
may also contain the NTE, and to look for additional 
conserved features that characterize this gene clade 
(see Methods). 

In addition to ROP2/8 and ROP5, we found signifi- 
cant matches in ROP4/7, ROP17 and ROP18, as expected, 
and also a number of additional subfamilies which appear 
to form a clade (Figure 1): ROP23, ROP24 (originally 
known as ROP2L8 [6]), ROP31, ROP40, ROP42/43/44, 
and the proposed ROP47. Four proteins in the ROPK- 
Unique (species-specific) category also showed evidence 
for NTE homology: TGME49_296000 (TGME49_096000 
in ToxoDB prior to version 8.0), also known as ROP2L12 
and previously identified as a pseudogene [6]; its orthologs 
TGVEG_050080 and TGGT1_054010; and the E. tenella 
protein ETH_00005190. A small number of sites in 
the NTE sequence region show strong conservation 
(Figure 8). 



Talevich and Kannan BMC Evolutionary Biology 201 3, 13:1 1 7 
http://www.biomedcentral.eom/1 471 -21 48/1 3/1 1 7 



Page 11 of 17 



a' a" BO 




Contribution 



Figure 8 HMM sequence logo of the NTE region. Conserved secondary structures are indicated above the corresponding sequence positions. 
Generated with the HMM-Logos server LogoMat-M [58]. 



Having identified the NTE-bearing clade, we then com- 
pared this clade to all other identifid ROPKs to identify 
clade-specific residue conservation patterns. In the solved 
structures of ROP2, ROP8 and ROP5, several of these dis- 
tinctive sites in the NTE clade are spatially located around 
the NTE itself, primarily near the conserved /30 and a sec- 
ondary structure elements. In ROP2, V330 and P333 in 
the /?4 sheet ^4-^4' loop are positioned on either side of 
the fiO sheet of the NTE, close to the conserved S244; in 
ROP5, the equivalent residues are V310 and Q313. In each 
of the solved crystal structures of ROP2 [PDB:2W1Z], 
ROP8 [PDB:3BYV] and ROP5 [PDB:3Q60], the £0 sheet 
passes directly between these two side chains, suggesting 
a structural selective constraint in NTE-bearing ROPKs. 

Three significantly contrasting sites in the E-helix 
may also have some bearing on the NTE conformation 
or placement: H378 near the aE N-terminus, oriented 
toward the NTE in the ROP2 structure [PDB:2W1Z]; 
V382, a small, nonpolar residue oriented toward the 
extended aD; and Q388 in the middle of the aE helix, 
where in the ROP2 structure it interacts with the back- 
bone of the conserved G198 at the N-terminus of the NTE 
a — though in the ROP5 structure the equivalent residue 
is 1368 which despite having the same orientation cannot 
form an identical interaction. 

Also in the aE helix, a hydrophobic residue (L391 ROP2 , 
A371 ROP5 ), in place of a usually basic residue outside 
the NTE clade, is oriented toward a helix which extends 
beyond the kinase C-terminus in the ROP2, ROP8 and 
ROP5 structures, previously described as the aH' helix 
[36]. Though this short, weakly conserved region is dif- 
ficult to detect by sequence analysis, the conservation of 
the hydrophobic residue in the aE helix and the presence 
of this helix in the available structures does suggest a cor- 
relation between the presence of the NTE and C-terminal 
aW helix. 

Discussion 

We classified the ROPKs into likely active kinases, likely 
pseudokinases, and predicted kinases that may be active, 
but with a noncanonical catalytic mechanism, based on 
differences in ePK-conserved residues surrounding the 



ATP binding pocket. Our alignment shows that con- 
served residues in or near the key ePK-conserved motifs, 
including the histidine of the canonical "HRD" motifs, 
are well aligned for each of these categories, so it is 
unlikely that the absence of the key aspartates in pre- 
dicted pseudokinases is due to misalignment. Structural 
investigation of the unusual motifs in noncanonical sub- 
families ROP24 and ROP45 in T. gondii could reveal novel 
kinase mechanisms of activation, ATP positioning and 
catalysis. Relatedly, analysis of the equivalent motifs in the 
ROPK pseudokinases could improve our understanding of 
pseudokinases in general. 

Our phylogenetic tree of ROPK subfamilies revealed 
three specific clades of interest: the NTE-bearing ROPKs, 
the only clade for which crystal structures have been 
solved or even homology models reliably constructed; an 
E. tenella-specific expansion of ROPKs; and the divergent, 
intron-bearing ROPKLs. Notably, each of these clades 
contains both predicted active kinases and pseudokinases, 
indicating a pattern of evolution in which, in a parsi- 
monious interpretation, pseudokinases repeatedly emerge 
from an ancestral state shared with active kinases, rather 
than a single or small number of expansions of pseudoki- 
nases. 

We were unable to find conclusive published evidence 
that the ROPKL proteins are indeed localized to the rhop- 
try during the tachyzoite stage of coccidians and expelled 
during invasion at the same time and through the same 
mechanism as other ROPKs. ROP35 protein expression 
has been detected during the T. gondii tachyzoite stage 
[59] and the E. tenella merozoite stage (ETH_00005905) 
[38]. Signal peptides were predicted for ROP33, ROP50 
and BPK1, but not ROP35, while the gene models of 
ROP34 and ROP46 contain a short or nonexistent N-tail 
to the kinase domain which could indicate a trunctated 
gene model. However, transcription levels across the cell 
cycle do not match the distinctive two-peaked pattern of 
T. gondii rhoptry proteins in any of the T. gondii ROPKLs 
[60]; the secretory organelle of BPK1 was not identified in 
the study that described the protein [44]. Our HMM pro- 
file search and gene trees indicated that the ROPKL pro- 
teins show stronger sequence similarity to typical ROPKs 
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than to any other characterized protein kinase family, 
leaving open the question of how deep their functional 
similarity goes. 

A common theme we observe in structural fea- 
tures unique to the ROPK family is the interaction 
between ROPK-specific inserts or structural motifs, 
including the N-terminal extension (NTE), and con- 
served sites within the kinase domain that show 
contrasting selection in ROPKs. Two regions in partic- 
ular, the kinase hinge region surrounding the aC helix 
and and the dusulphide bridge at the end of the aE 
helix, suggest several possible functional or mechanistic 
consequences. 

Our observations in the ROPK hinge region raise several 
hypotheses. The aC insert in the aC-/34 loop has pos- 
sible structural analogues in other kinases. The vaccinia- 
related kinase (VRK) family has a similar insert which 
packs hydrophobically against the aE helix and was pro- 
posed to promote a closed conformation of the kinase 
domain in the pseudokinase VRK3 [61]; the authors of 
that study suggested that related active kinases that retain 
the same feature would be constitutively active. Com- 
parison of the structure of VRK3 [PDB:2JII] to that of 
ROP2 [PDB:2WIZ] indicates that the ROPK-conserved 
site L396 ROP2 (Figure 6A,B) may perform a similar role 
to the VRK3-conserved F296 VRK3 in hydrophobically cou- 
pling the two lobes of the kinase domain. Interestingly, 
the ATP-bound and apo structures of the pseudokinase 
ROP5 show very little overall conformational change [37]. 
As another example, crystal structures of the yeast SRPK 
protein Skyl conserve a short aC helix insert, and the 
flexibility of this region is indicated to be critical for 
interlobe closure [62]. Together with the ROPK-specific 
conservation of prolines in the a C-/34 loop and linker, this 
could indicate the possibility that these differences mod- 
ulate interlobe closure (the kinase hinging mechanism) in 
ROPKs. 

Another hypothesis regarding the function of the aC 
helix, not necessarily conflicting with the above hypothe- 
sis, is that it could serve as a binding interface or protein- 
protein interaction site. We observed that the aC helix 
does not pack hydrophobically against the N-lobe of the 
kinase domain in the available ROP2 structures; instead, 
there appear to be water molecules in between [PDB: 
3Q60] [37]. The B-factors are somewhat higher than in 
the immediately surrounding areas, and the symmetry 
of the ROP2 structure suggests that the insert may have 
been stabilized in this structure by crystal packing. Given 
that the same region is disordered in the available ROP5 
structures, it appears possible that aC f may be relatively 
flexible, capable of unfolding from the helical secondary 
structure into a mobile loop. For comparison, in VRK3, 
a surface patch centered on the aC-aC region has been 
proposed as a binding site [61]. 



In the kinase C-lobe, a pair of ROPK-conserved cys- 
teines form a disulfide bridge between the end of the 
aE helix and the aG-aH loop, which is extended in 
most ROPKs. A conserved tryptophan adjacent to the 
aE cysteine packs hydrophobically against the aD and 
aE helices, which are also extended in ROPKs; thus the 
"WC" motif appears to couple both ROPK inserts to the 
kinase C-lobe. Notably, this stabilization occurs in the 
surface region of the protein that was identified as poly- 
morphic between ROP5 alleles in T. gondii [37], and was 
recently shown to be the interface of an interaction with 
the host (mouse) immunity-related GTPase (IRG) protein 
[19]. Reese et al. proposed an allosteric network involv- 
ing the NTE and aE helix to link the polymorphic surfaces 
in the C-lobe and kinase active site in ROP5 [37]. The 
variability of this site in ROPKs may therefore be justified 
by its involvement in that network, which itself appears 
to be variable in ROPKs. We can hypothesize that, at 
least in ROP5, the increased structural stability provided 
by the WC motif in this region permits these subfamily- 
specific mutations to proliferate at this surface without 
compromising the folding or stability of the kinase C-lobe 
[63]. This hypothesis assumes that the disulfide bridge is 
indeed maintained throughout the lifespan of the protein; 
while it appears as such in the available solved structures, 
we note that once the protein is inside the host cell, the 
cytosolic environment is not conducive to disulfide bond 
formation. The two cysteines involved are co-conserved 
in not only the PVM-associated ROP2, ROP8, ROP5 and 
ROP18, but also ROP16, which has been shown to be 
localized to the host nucleus [22], among other ROPKs. 

We also searched for sites that showed conservation 
specific to the NTE-bearing ROPK clade, rather than 
ROPKs as a whole. Interestingly, only a small number of 
strongly contrasting sites emerged as specific to this clade. 
This could indicate that the mechanistic roles of the NTE 
vary across even the NTE-bearing clade of ROPKs. 

More structural information will be essential to further 
understand the ROPK family. Currently, only ROPKs from 
the ROP2/8 and ROP5 subfamilies within the NTE clade 
have been solved [36,37,50]. While these structures have 
been invaluable in understanding ROPK mechanisms and 
possible functions, the low sequence identity and pres- 
ence of indels across subfamilies makes it difficult to 
produce reliable homology models for ROPK subfami- 
lies outside this clade. We can suggest several important 
ROPKs outside the NTE clade which appear to be active 
kinases, are highly expressed [10], and from which we 
could gain important insights from the solved crystal 
structure. ROP16 was indirectly implicated in virulence 
differences between T. gondii strains in mice [15], and also 
shown to to modulate the host STAT3 and STAT6 path- 
way response [22-26], but the precise mechanisms of this 
action remain to be discovered. Peixoto et al [10] found 
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evidence that ROP38 is involved in modulating the MAPK 
cascade; the ROP19/29/38 subfamily was also found to 
be independently duplicated in T. gondii and N. caninum, 
thus the other subfamily members could easily be mod- 
eled if a ROP38 structure were available. Finally, ROP35 is 
a representative member of the divergent, poorly under- 
stood ROPKL clade; the presence of several indels relative 
to other ROPKs at structurally important locations in the 
sequence suggest that a crystal structure would almost 
certainly reveal surprising variations on the ePK fold and 
catalytic mechanisms. 

Conclusion 

In this study, we developed novel bioinformatic methods 
to study patterns of diversification and neofunctional- 
ization in the rhoptry kinase family, and integrated the 
results of a systematic, multi-species analysis with the 
structural context provided by the solved structures. Our 
phylogenetic analysis revealed a subfamily-level structure 
shared across species, as well as lineage-specific expan- 
sions within the ROPK family and three distinct sub- 
clades of ROPK. We applied general knowledge of protein 
kinase mechanisms to categorize each rhoptry kinase as 
a likely active, likely pseudokinase, or potentially active 
but with an atypical catalytic mechanism. We determined 
the sequence and structural features that distinguish these 
subfamilies from each other, as well as those that dis- 
tinguish the ROPK family as a whole from typical ePKs. 
Where possible, ROPK-specific motifs were placed into 
structural context to develop functional hypotheses. 

This work sheds light on several important but pre- 
viously unrecognized features shared among rhoptry 
kinases, as well as the essential differences between active 
and degenerate protein kinases or pseudokinases. Our 
studies provide specific hypothesis for further character- 
izing ROPK structure and function and also inform ongo- 
ing efforts to design protein kinase inhibitors for global 
diseases caused by coccidian parasites. 

Methods 

Data collection 

The sequences of translated gene models, unannotated 
genomes and ESTs from the species Toxoplasma gondii, 
Neospora caninum, Eimeria tenella were retrieved from 
ToxoDB version 8.1 [64]. Pre-release genomic sequences 
and ESTs of Sarcocystis neurona were provided by the lab- 
oratories of Dan Howe, Christopher Schardl and Jessica 
Kissinger. 

After constructing the initial ROPK subfamily profiles 
(below), additional ROPK sequences were identified in 
the NCBI databases est_others and nr and added to 
the profiles. To obtain putative ROPK sequences from 
the unannotated T. gondii and S. neurona genomes, we 
used the program exonerate (https://www.ebi.ac.uk/- 



guy/ exonerate/; also see [65]) to align the ROPK subfamily 
consensus sequences to each genome scaffold sequence, 
omitting introns according to likely splice sites. A script 
using Biopython [66] was then used to extract the highest- 
scoring putative protein sequences from the exonerate 
output and combine identical sequences and sequence 
fragments. 

Subfamily classification 

We previously constructed a database of HMM profiles 
for every known protein kinase family and subfamily 
defined in KinBase [67], as well as several apicomplexan- 
specific kinase families [11]. The ROPK profile in this set 
was initially constructed from annotated ROPK sequences 
in ToxoDB, similar to the techinique described by Peixoto 
et al. [10]. Sequences were aligned using MAFFT ver- 
sion 6.940 [68] with a "seed" alignment of the protein 
kinase domain constructed using published PDB struc- 
tures [PDB: 2W1Z, 3BYV, 3DZO, 3Q5Z, 3Q60] [36,37,50] 
and the structure alignment program TM-align (May 2012 
release) [69]. Finally, HMM profiles were constructed 
from each sequence alignment and compiled into an 
HMM profile database (Additional file 3). We used this 
HMM profile database to search the protein and trans- 
lated EST sequences described in the previous section; 
those which scored as stronger matches to the ROPK- 
specific HMM profile than to our ePK profiles were taken 
as an initial set of putative rhoptry kinases. 

We developed a program called Fammer to partially 
automate the construction and curation of hierarchical 
protein subfamily sequence profiles for use with HMMer 
3.0 [70] and MAPGAPS 1.0 [71], and to use these HMM 
and MAPGAPS profiles for sequence search, classification 
and alignment. The Fammer software package, including 
source code, documentation and the ROPK sequence pro- 
files used in this study, is available at http://github.com/ 
etal/fammer. 

The full-length ROPK sequences identified in each 
annotated coccidian genome and translated EST set were 
clustered using OrthoMCL version 2.0.3 [72]. We man- 
ually trimmed the sequences in each OrthoMCL cluster 
to the canonical protein kinase domain and aligned the 
sequence sets with Fammer version 0.1.0 to create an ini- 
tial set of ROPK subfamily profiles, as well as a set of 
"unique" or orphan ROPKs which matched the ROPK 
HMM profile but were not placed into a larger cluster by 
OrthoMCL. 

Iteratively, we performed the following steps to refine 
the ROPK subfamily classification. We constructed a 
phylogenetic tree of the consensus sequences of each 
putative ROPK subfamily, using FastTree version 2.1.5 
[73], and merged ortholog groups which were sepa- 
rated by short branches in the tree and, for subfamilies 
that appeared in multiple copies within a single genome 
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(e.g. ROP2/8, ROPK-Eten3), showed co-localization in 
the chromosome. Existing descriptions of the annotated 
T. gondii proteins were used to assign names to subfam- 
ilies. Unannotated subfamilies that were phylogenetically 
placed basally to the known ROPKs, indicating closer 
relationship to other ePKs, were removed. We visually 
inspected each subfamily sequence set for potential out- 
lier sequences, on the basis of conserved motifs in key 
regions of the kinase domain, and moved any of these to 
the "unique" sequence set. We used the Fammer build 
command to realign all sequences and to construct an 
HMM profile database of all subfamily profiles, then used 
this database with the Fammer scan command to reclas- 
sify the "unique" or outlier ROPK sequences. We included 
a profile of non-ROPK protein kinase sequences in this 
HMM database in order to identify and remove false pos- 
itives in the "unique" set as well as subsequent searches 
of the coccidian proteome, genome and EST sequences. 
Finally, we used the Fammer refine command to per- 
form leave-one-out validation of each subfamily profile 
versus the "unique" sequence set, following the approach 
described by Hedlund et al. [74]. This process yielded 
42 stable subfamilies of ROPK, along with a "ROPK- 
Unique" profile set of unclassified orphan sequences. We 
then identified the ROPK complement in each annotated 
proteome by running the Fammer scan command with 
the final ROPK HMM profile database, each coccidian 
species' proteome sequences, and an expectation-value 
cutoff of 10- 10 . 

Subfamily tree inference 

We used the curated alignment of consensus sequences 
from each ROPK subfamily profile and the non-ROPK 
protein kinase profile as input to infer phylogenic trees. 
To quickly examine the structure of the ROPK family dur- 
ing profile refinement, we used FastTree [73] with the 
WAG scoring matrix, gamma model of rate variation and 
pseudocount correction for gaps. To infer the final tree 
shown in Figure 1, we first used the GUIDANCE server 
[75] with 100 replicates of PRANK and removed columns 
with less than 5% support, in order to remove align- 
ment columns that were likely to have been misaligned 
while retaining most of the potentially phylogenetically 
informative columns. We then used a script to remove 
columns that were more than 30% gap characters. This fil- 
tering yielded an alignment of 279 columns, slightly less 
than the length of the top-level ROPK HMM profile (288 
columns). We inferred the tree from this alignment using 
PhyML (December 2011 release) [76], with the LG scor- 
ing matrix, gamma model of rate variation, empirically 
estimated amino acid frequencies and 100 bootstrap runs, 
taking the output of FastTree as the user-supplied start- 
ing tree. Finally, we used script based on the Bio.Phylo 
module of Biopython [77] to reroot the tree with ePK as 



the outgroup, collapse all splits with less than 25% boot- 
strap support, colorize the specific clades of interest and 
visualize the tree. The alignment of subfamily consensus 
sequences and the inferred tree have been deposited in 
TreeBase (http://www.treebase.org/; Study ID: 14212). 

Analysis of evolutionary constraints 

To identify sites of contrasting conservation between 
ROPK subfamilies, and between all ROPKs and the 
broader protein kinase superfamily, we compared aligned 
sites between two given sequence sets by applying a multi- 
nomial log-likelihood test (G-test) [78] of the residue 
compositions of each column in the two sets. The test 
statistic G is derived from the frequencies of each amino 
acid type as observed in the "foreground" set, O/, and 
as expected based on the "background" set, £/, including 
pseudocounts taken from the amino acid frequencies of 
the full alignment. 

To adjust for the non-independence of sequences in 
each set due to phylogenetic relatedness, the aligned 
sequences in each set are weighted according to the 
Henikoff heuristic [79], and the amino acid counts in each 
column are adjusted according to these sequence weights, 
an approach also used in PSI-BLAST [80]. The test statis- 
tic G follows the chi-squared distribution with 19 degrees 
of freedom (for the 20 amino acid types). 

We implemented this test in a program called 
CladeCompare, available at http://github.com/etal/ 
cladecompare. The output of the program includes (i) 
a table of the probabilities (p-values) of each site in the 
combined alignment, (ii) a list of the significantly con- 
trasting sites after adjusting for multiple testing using the 
Benjamini-Hochberg false discovery rate method [81], 
and (iii) images of paired "background" and "foreground" 
sequence logos to illustrate the contrast at significant 
sites, generated using the WebLogo [82] and ReportLab 
[83] libraries. 

Detection of the N-terminal extension in additional 
subfamilies 

To identify which ROPK subfamilies share sequence 
homology to the NTE region observed in the ROP2, ROP8 
and ROP5 structures, and suggested to be present in 
ROP18, ROP4/7 and ROP17, we used the CHAIN pro- 
gram [84] with the previously identified NTE-bearing 
sequences as the query set and the complete set of full- 
length ROPK sequences as the main set. CHAIN identi- 
fied a "foreground" partition corresponding to the clade 
highlighted in Figure 1. 

We then constructed an alignment of the sequence 
regions N-terminal to the kinase domain in the identified 
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using the "accurate" mode of T-Coffee [85], built an HMM 
profile from this alignment, and used HMMer 3.0 [70] to 
search the full-length ROPK sequences. This recovered 
the same ROPK subfamilies identified by CHAIN, con- 
firming the presence of homologous NTE regions in those 
subfamilies. 

Structural analysis 

Sites of interest were mapped onto PDB protein structures 
with a script and visualized in PyMOL [86] for manual 
inspection. 

Availability of supporting data 

The data sets supporting the results of this article 
are available in the TreeBase repository, http://purl.org/ 
phylo/treebase/phylows/study/TB2:S14212. 



Additional file 



Additional file 1 : ROPK subfamily members in coccidian genomes. 

Classification of each rhoptry kinase gene in selected genomes into the 
identified subfamilies. 

Additional file 2: Conservation contrasts in aligned ROPK subfamily 
consensus sequences. Alignment of the ROPK family consensus and the 
consensus sequences of each subfamily. The first row in each block 
indicates contrasting features of the ROPK family sequence profile versus 
the protein kinase superfamily, while all other rows indicate patterns 
specific to each subfamily profile. Profile-specific inserts are shown in 
yellow, relative deletions in gray. The statistical significance of the contrast 
in residue composition at each site is shown as a heat map, with 
non-significant sites in shades of blue, p-values between 0.05 and 0.01 in 
white, and p-values less than 0.01 in increasingly dark shades of red. 
Significant sites in the ROPK-PK comparison are indicated with asterisks 
along the top of the alignment. This visualization was generated by 
CladeCompare version 0.1 .0. 

Additional file 3: ROPK subfamily HMM profile database. HMMer 3.0 
database containing profiles for each individual ROPK subfamily, the ROPK 
family as a whole, and typical protein-kinase-like (PKL) sequences which 
can be used to distinguish ROPKs from other protein kinases. 
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